Àá½Ã¸¸ ±â´Ù·Á ÁÖ¼¼¿ä. ·ÎµùÁßÀÔ´Ï´Ù.
KMID : 1022420180100010039
Phonetics and Speech Sciences
2018 Volume.10 No. 1 p.39 ~ p.48
An end-to-end synthesis method for Korean text-to-speech systems
Choi Yeun-Ju

Jung Young-Moon
Kim Young-Gwan
Suh Young-Joo
Kim Hoi-Rin
Abstract
A typical statistical parametric speech synthesis (text-to-speech, TTS) system consists of separate modules, such as a text analysis module, an acoustic modeling module, and a speech synthesis module. This causes two problems: 1) expert knowledge of each module is required, and 2) errors generated in each module accumulate passing through each module. An end-to-end TTS system could avoid such problems by synthesizing voice signals directly from an input string. In this study, we implemented an end-to-end Korean TTS system using Google's Tacotron, which is an end-to-end TTS system based on a sequence-to-sequence model with attention mechanism. We used 4392 utterances spoken by a Korean female speaker, an amount that corresponds to 37% of the dataset Google used for training Tacotron. Our system obtained mean opinion score (MOS) 2.98 and degradation mean opinion score (DMOS) 3.25. We will discuss the factors which affected training of the system. Experiments demonstrate that the post-processing network needs to be designed considering output language and input characters and that according to the amount of training data, the maximum value of n for n-grams modeled by the encoder should be small enough.
KEYWORD
attention mechanism, end-to-end, Korean text-to-speech system, sequence-to-sequence, Tacotron
FullTexts / Linksout information
Listed journal information
ÇмúÁøÈïÀç´Ü(KCI)